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Abstract 


The multilayer neural network architecture has seen relatively little number- 
theoretic research. However, improving fundamental computing capacity by 
extending to the more extensive set of possible numbers and functions could 
improve machine learning results. Networks often use real and sometimes 
complex numbers, and extending numbers to contain a more rich structure 
may improve the abstract processing of real-world data. A polynomial ring 
can generalize multiplication to convolution while the neural network is as 
close as possible to the standard one. Experiment using synthetic problem 
compare performance between real-valued standard and polynomial ring im- 
plementations. 
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1. Introduction 


Number theoretic research in neural networks research has not been very 
common although more complex computations and functions could improve 
optimization results. This happens without the need to do significant mod- 
ifications to the standard architectures (fully connected, recurrent) or core 
algorithms (backpropagation, gradient descent, activation functions) [1]. In 
research, neural networks have been typically only extended to process com- 
plex numbers or sometimes quaternions [2, 3, 4, 5). 


Email address: tomas .ukkonen@novelinsight.fi (Tomas Ukkonen) 
URL: www.novelinsight .fi (Tomas Ukkonen) 


Preprint submitted to Cognitive Systems Research January 2, 2023 


2. Theoretical background 


This paper’s motivation to extend real numbers to a polynomial ring is 
an intuition that real numbers can be extended to contain dimensionality 
information using hyperreal numbers, non-standard analysis, and fractal ge- 
ometry [6, 7, 8]. A quantity of dimension d can be the multiplication of a 
real-valued scalar a and a volume of d-dimensional hypercube r@, where r is 
the arbitrary (unit) length of one side of the hypercube. One can then make 
a somewhat simple definition of “multidimensional” numbers s. 


s=ajp*rota,*r' t+ay¥r’... (1) 


Compared to the magnitude of real numbers, r’s length is infinite and 
outside of the set R, so the components are in the same direction but per- 
pendicular to each other in a vector space sense because there is no scalar 
a € R to scale different components to be the same. It seem to be possible to 
extend this line of reasoning formally to cases where d is real-valued, mean- 
ing fractal dimensions, or abstractly to imaginary and negative dimensions 
(r~' is infinitesimal dr) and, in addition to multiplication, define a stretch- 
ing operation c to an object anew * 7? = Gog * (C* T)? = Gog * ci *r?. By 
incorporating dimensional information into a neural network, it might better 
abstractly process real-world data. 

To have a well-defined closed number system that can compute using 
computers, dimensions are restricted to be natural numbers N and a real 
polynomial ring R[X]/(X* — 1) [9] is defined using the following definitions 
(Equation 2). 


eal 
oS So aa*r4,aa € R,dEeN 
d=0 


K-1 


d 
8, +S. = y (aa, + Ga.) *T (2) 
d=0 
KURA 
8, * 8, = ) Qa, * Ady * garb d2tnod) 
d,=0,d2=0 


By choosing modulo operation K to be a prime number, it is easy to see 
that the polynomial ring becomes a field with a finite K number of compo- 
nents. Multiplication operation now leads to a circular convolution of the 


coefficients. The circular convolution and its inverse, the division opera- 
tion, can be efficiently calculated using discrete Fourier transform [10]. By 
making dimensions d circular, the comparability of the numbers is lost even 
when using real coefficients. However, if numbers/the circular convolution is 
properly zero-padded, it is often possible to calculate the standard convolu- 
tion. The circular convolution operation processes each component symmet- 
rically, ignoring dimensional information. This asymmetry is later (weakly) 
imposed by applying stretching operation when initializing the neural net- 
work’s weights so that weights are approximately zero-padded. 


3. Neural network architectures using polynomial ring numbers 


3.1. Linear model 


The simplicity of the linear optimization is that the function y = Axx+b 
has only one easily solvable MSE optimum, and it is the global optimum of 
the problem. In practice, linear functions are often too simple for many 
practical problems, but if we can do a non-linear number theoretic extension 
to real numbers fulfilling field axioms, it is possible to solve the global opti- 
mum of non-linear problems. Unfortunately, for polynomial field numbers s 
and real-valued data, the non-real parts of the coefficients are always zero, 
meaning that it is impossible to extend calculations non-linearly. 

In data analytics, however, it is common to discretize real-valued variables 
using one-hot encoding, which maps real numbers to higher dimensional vec- 
tors, after which the global optimum of an approximated problem is solvable. 
In this paper, other number theoretic possibilities are not studied. A multi- 
layer neural network is used instead, in which non-linearities make it possible 
to process real-valued data using extended number systems. 


3.2. Neural network model 


The neural network was kept as close to linear as possible to keep opti- 
mization methods like gradient descent functional. The densely connected 
neural network’s layers use a leaky ReLU non-linearity [11, 12] except at the 
last layer, which is entirely linear. The ReLU activation function is extended 
to polynomial fields by applying ReLU non-linearity f(s) only to the first 
dimension and not changing the other components. For polynomial field 
numbers s, a heuristical derivate of ReLU is used (Equation 3), which gives 
much better optimization results, than calculating derivate directly from the 
definition. 


maxz(as;,s;), ifj=0, 
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(3) 


a, if so <0, 
1, if so > 0. 


df /ds = 


Derivates are well-defined for linear algebra operations when using poly- 
nomial field numbers s. This is because derivates are not dependent on what 
direction zero is approached. 


df (s)/ds = lim ((a *«(s+h)+b)—axs)/h=limp sa =a (4) 


To calculate the derivate of mean squared error using polynomial field, 
MSE is written using an operator x, which multiplies each component of the 
polynomial field. 
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We can then write MSE error as a * product of 1 = 1 * r°, which selects 
the real part of the error term, and a * product of error delta terms that 
squares error values. Equation 6 is for simplicity written using dimensions j 
but can be generalized to vectors and its derivate calculated from Jacobian 
matrix. 


MSB(w) = 1% (saz 32 slit) — 4a) * GGalw) — 94) (6) 


Now, because of the close resemblance of our polynomial ring to the 
Fourier transform, it is possible to transform * products to convolutions by 
calculating the Fourier transform § of the terms. These convolutions and 
Fourier transforms can be easily derived because they are not dependent 
on the neural network’s weight parameters w. In gradient equations 7, a o 
operator marks circular convolution operation. 


MSE(w) =1* MSE'(w) 
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In experiments, non-real parts of the error vector is always set to be 
zero. The idea that stretching an object should not change the properties 
of an object is used to apply stretching operation c = 0.25 to initial random 
weights. This means that larger dimensional weights are initially close to 
zero, and the neural network processes data in low dimensions. Therefore 
computations start with zero padding of the convolution operations between 
numbers but allow the network to use the circular convolution and higher 
dimensions if it reduces MSE error. 


3.8. Gradient descent algorithm 


Algorithm 1 describes a modified gradient descent algorithm that trains 
polynomial ring neural networks. It uses costly adaptive step length and 
sometimes enters worse solutions. The mechanism allows the algorithm to 
escape from local minimums into what polynomial ring minimization often 
gets stuck. The algorithm rarely converges but tries worse solutions looking 
for a way away from the local minimum. Therefore execution of the algorithm 
was stopped after 500-2000 iterations when there were no more improvements 
or when execution time was 2.5-48 hours. 


4. Experimental results 


4.1. Experiment 1 


The first experiment uses a synthesized problem and a small number of 
parameters. The amount of synthetic data is only N = 1000, so even our 
unoptimized algorithm would run the experiment in 24 hours (and similarly 
also in Experiment 2). The training did not do early stopping and overfitted 
to data. The standard neural network implementation is TensorFlow with 
Adam optimizer which is comparated against our gradient descent algorithm. 
The non-linearities of the first layers of the neural network are ReLU activa- 
tion functions, and the last layer is linear. Here it is interesting how a neural 


Algorithm 1 Gradient Descent algorithm for polynomial ring neural net- 
work 

1: procedure GRADIENTDESCENT 

2: e€— © 


3 while e > error_limit do 
4 e + calculate_error(w) 
5 g « calculate_gradient(w) 
6: errors <- co 
ve epsilon < 0.02 
8 iters <— 0 
9: while last(errors) > e & iters < 500 do 
10: epsilon < epsilon /2 
11: Wneat <- w+ epsilon * g 
12: iters <— iters + 1 
13: errors < calculate_error(Wnect) 
14: r = (random()%40) 
15: if last(errors) < 2 * average(errors, 20) & r==0 then 
16: errors < 0; goes to worse solution with a 2.5% probability 
17: W — Wnert 


network learns simple non-linear tasks. The input variables x ~ N(0,4?I) 
are mapped to values y using the following equations. 


yi(x) = sin(f * 21 * Lo * £3 * L4) 


Yo(X) = sign(ax4) * q@t/(esl+1) 


y3(X) = sign(cos(w * x1)) + sign(x2) + sign(x4) (8) 


ya(x) = ®o/(|e1| +1) + 23 * V|x4| + [v4 — 21 
F=10.0-0 =100,a=411 


Results of overfitting optimization are in Table 1. The algorithm com- 
puted results five times, and the best value was chosen. The problem was too 
complicated for a small two-layer 4 — 4 — 4 neural network, and algorithms 
couldn’t learn the dataset. However, even in this case, standard neural net- 
work implementation performed better for normal data than polynomial ring 
neural networks. The difference remained even with larger neural networks 
while the computing time to converge to a minimum error was much bigger, 
5 minutes with a standard TensorFlow implementation and 12-24 hours with 
the polynomial ring neural network. This result shows that polynomial ring 
neural networks still require more work to optimize the non-linearity, the 
number of dimensions, initialization of weights, and the learning algorithm 
if they would be usable as general problem solvers in machine learning. 


Architecture Error 
4-4-4 (real-valued) 1.0245 
4-4-4 (11 dimensional numbers) 1.10242 
4-20-20-4 (real-valued) 0.4218 
4-20-20-4 (11 dimensional numbers) | 0.6310 


Table 1: Minimum absolute error found with different neural network architectures. 
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5. Discussion 


In this paper, an exciting number theoretic extension to neural networks 
is described. The results implies that the extended neural network is able to 
learn more complex functions with smaller errors. However, the optimization 
is plagued by problems such as getting stuck to local optimums, which re- 
quire using heuristical derivate in non-linearity and modified gradient descent 
algorithm. 

Although it has not been studied in this paper, the convolutional nature 
of the number system could make it helpful in the real-world processing 
of audio and picture signals. Another line of further study would be to 
try the effect of different nonlinearities to the maximum possible network 
depth. In our experiments, the depth was only a few layers, but the current 
framework scales up to 40 layers using a residual neural network. More work 
is also needed to test the polynomial ring neural network with more complete 
problems. 


6. Funding and further information 


This research has not received any funding or grants from any sources. 


Polynomial ring neural network implementation will be later published as 
open source as part of Dinrhiw2 C++ machine learning software library [15]. 
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